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There  are  a  number  of  general-purpose 
microprocessor  architectures  which,  while  not 
designed  for  high-end  signal  processing,  might 
provide  the  processing  performance  required  for 
complex  radars,  signal  intelligence  and  other 
demanding  applications.  But  how  well  does  each 
really  perform  as  a  digital  signal  processor? 

To  answer  this  question,  some  simple 
benchmarks  were  run  on  a  1GHz  Freescale 
7447  PowerPC,  1.8  GHz  IBM  970  PowerPC,  1.8 
GHz  AMD  Opteron  and  800  MHz  Broadcom 
MIPS-based  1250  chip. 

The  bottom  line  of  this  set  of  benchmarks  is  that 
the  PowerPC  with  AltiVec  produces  impressive 
computational  performance  compared  to  the 
other  processors  considered.  Now  that  IBM  is 
shipping  its  PowerPC  970  with  AltiVec,  there  is  a 
processor  alternative  that  addresses  the 
memory  bandwidth  limitations  of  the  7447. 

Yet,  despite  the  strengths  of  AltiVec,  the 
benchmarks  revealed  that  the  alternative 
processors  offer  some  interesting  capabilities  for 
particular  types  of  signal  processing.  For 
example,  memory  bandwidth  may  be  more 
important  than  sheer  speed,  or  where  parts 
count  is  a  limitation. 

MEMORY  READ  BANDWIDTH 

To  measure  the  memory  read  bandwidth  of  the 
processors  considered,  a  trivial  vector-sum 
computation  was  developed.  In  this  simple 
benchmark,  as  well  as  in  others,  all  of  the 
processors  suffer  a  definite  step  down  in 
bandwidth  when  vector  length  exceeds  the  LI 
cache  size,  requiring  access  to  L2  cache. 
Likewise,  performance  further  degrades  when  a 
vector  exceeds  the  size  of  the  L2  cache  and  an 
access  to  DRAM  main  memory  is  required. 

The  benchmark  operation  consisted  of  summing 
the  first  byte  of  every  32-byte  cache  line  and 
storing  the  result  in  a  register,  discarding  most  of 
the  data  from  the  cache  line.  This  “for-loop” 


methodology  was  chosen  because  the 
benchmark  is  intended  to  measure  bandwidth, 
not  computational  performance. 

As  might  be  expected,  the  800  MHz  Broadcom 
BCM1250,  with  the  lowest  operating  frequency 
of  the  group,  also  has  the  lowest  bandwidth, 
whether  the  access  is  to  LI  or  L2  cache.  Despite 
the  fact  that  this  dual-processor  chip  has 
integrated  memory  controllers,  it  still  lags  behind 
the  other  processors  when  accessing  DRAM. 


The  change  in  performance  of  the  PowerPC 
7447  is  quite  clear  as  the  vector  size  overflows 
the  LI  cache.  The  change  is  almost  as  dramatic 
when  the  L2  cache  overflows,  though 
performance  for  512-Kbyte  long  vectors  is  less 
than  expected.  Where  the  surprises  lay  in  this 
benchmark  were  in  the  behaviors  of  the  Opteron 
and  PowerPC  970  processors,  both  1.8  GHz 
parts. 

The  Opteron  chip,  for  example,  has  by  far  the 
best  bandwidth  of  the  group  when  operating  out 
of  LI  cache,  but  its  DRAM  bandwidth  is  only 
marginally  better  than  the  alternatives,  and  its  L2 
bandwidth  lags  all  of  the  other  processors 
except  the  Broadcom  BCM1250. 

The  biggest  surprise,  however,  lay  in  the 
behavior  of  the  970,  which  has  a  very  fast  clock. 
The  970  had  the  second  slowest  LI  bandwidth 
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of  the  group  -  despite  having  almost  twice  the 
operating  frequency,  of  the  1  GHz  PowerPC 
7447,  for  example.  The  reason  for  this  appears 
to  be  the  rather  deep  pipeline  of  the  970,  and 
the  trivial  nature  of  this  benchmark.  More 
complicated  tests  enable  the  970  to  perform 
better  when  compared  to  other  processors. 

On  the  other  hand,  the  benchmark  results 
clearly  showed  the  superior  efficiency  of  the 
970’s  L2  cache  and  automatic  pre-fetch  engines. 
The  bandwidth  falloff  between  LI  and  L2  caches 
of  this  processor  is  quite  minor,  whereas  the 
bandwidth  of  all  the  other  processors  in  the 
group  falls  substantially  when  vector  length 
forces  an  L2  access.  The  970’s  pre-fetch 
engines  analyze  the  memory  access  behavior  of 
the  application  and  will  start  fetching  data  from 
memory  before  the  application  requests  it  if  the 
accesses  are  regular  enough. 

MEMORY  READ  BANDWIDTH  WITH  PRE¬ 
FETCH 

All  of  the  processor  architectures  considered 
have  some  programmable  pre-fetch  capabilities. 
This  allows  the  application  to  predict  future  data 
requests  and  issue  “touch”  instructions  to  ask 
the  processor  fulfill  the  requests  in  advance.  A 
pre-fetch  factor  of  3  was  selected  for  this  third 
benchmark,  the  factor  being  chosen  somewhat 
arbitrarily  -  touches  are  issued  3  loop  iterations 
ahead. 


This  benchmark  modification  had  little  effect  on 
the  behavior  of  the  970,  which  has  built-in 
engines  for  predicting  memory  requests  and  is 
always  doing  work  to  optimize  its  memory 
bandwidth.  Nor  did  the  modification  much  affect 
the  performance  of  the  Opteron.  The  7447s, 


however,  suffered  a  serious  slowdown  when  the 
vector  size  fits  into  cache. 

This  benchmark  dramatically  illustrated  an 
important  capability  of  the  BCM1250  chip.  Using 
pre-fetch  produced  dramatic  improvements  in  L2 
and  DRAM  bandwidth.  DRAM  bandwidth,  for 
example,  went  up  by  a  factor  of  6.  Performance 
approaching  the  7447  bandwidth  is  possible  with 
additional  pre-fetch. 

As  for  the  7447,  although  there  are  some 
dependencies  on  the  system  controller  chip 
used,  the  general  lesson  is  that  the  programmer 
needs  to  be  careful  with  pre-fetch.  The 
advisability  of  pre-fetching  will  depend  on  the 
algorithm. 

Digital  Signal  Processing 

The  final  benchmark  reported  here  reveals  how 
three  of  the  processors  perform  when  running  a 
simplistic  signal  processing  application.  For  this 
test,  the  assumed  source  is  a  sensor  such  as  a 
radar  receiver,  providing  16-bit  integer  data, 
which  has  then  been  digitized. 


Signal  Processing 


In  the  benchmark,  the  data  is  converted  to  float, 
then  a  forward  FFT  is  performed,  followed  by  a 
vector  multiply  and  an  inverse  FFT.  This 
resembles  “pulse  compression”  in  radar  where  a 
convolution  is  performed  on  the  input  data,  or  a 
frequency  domain  filter  used  in  signal 
intelligence.  The  shape  of  these  curves  and  the 
relative  performance  of  the  processors  is 
dominated  by  the  FFT  performance.  AltiVec 
provides  a  clear  advantage  here. 
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Evaluate  various  general-purpose  processors  in 
Radar  Signal  Processing 

□  Requires  processing  of  complex  data  types 

□  All  processors  have  SIMD  capability 

Freescale  744x  (G4)  is  popular,  but  several 
potential  alternatives  have  become  available 
recently 

NOT  evaluated  for  this  project:  DSP  chips,  FPGA 
solutions 
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AMD 

Opteron 

Broadcom 

BCM1250 

IBM 

PowerPC  970 

Motorola/ 

Freescale 

PowerPC 

7447  (G4) 

Frequency 

1.8  GHz 

800  MHz 

1.8  GHz 

1.0  GHz 

On-chip  cache 
(LI) 

64  Kbytes 

32  Kbytes 

32  Kbytes 

32  Kbytes 

On-chip  cache 
(L2) 

1024  Kbytes 

512  Kbytes 
(shared  by 
processors) 

512  Kbytes 

512  Kbytes 

L3  Cache 

None 

None 

None 

None 

Memory 

Controller 

Dual¬ 
channel  on- 
chip 

Dual-channel 

on-chip 

Apple  dual 
channel,  external 

Discovery  II 
single  channel 

DRAM 

128  bits, 

133  MHz 

DDR 

(PC2100) 

128  bits,  133 

MHz  DDR 
(PC2100) 

128  bits,  200 

MHz  DDR 
(PC3200) 

64  bits,  133  MHz 
DDR  (PC2100) 

Comments 

Dual  processors 
on  chip 

Apple  dual¬ 
processor  SMP 
configuration 
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Read  Bandwidth  -  Unrolled 
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Read  Bandwidth  with  Pre-Fetch 
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Complex  Vector  Multiply  using  SIMD 
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processor  SMP 
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1.8  GHz  Opteron 
1  GHz  7447 
1.8  GHz  970 


N  'V  *  %  ^ 

Length  (K  elements) 
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